Skip to content
This repository was archived by the owner on Sep 9, 2025. It is now read-only.

Conversation

@nimbinatus
Copy link

@nimbinatus nimbinatus commented Feb 5, 2025

This proposal discusses how the concept of the taxonomy should change and evolve to meet user needs.

@nimbinatus nimbinatus marked this pull request as draft February 5, 2025 21:51
Signed-off-by: Laura Santamaria <[email protected]>
@nimbinatus nimbinatus force-pushed the taxonomy-revamp-2025 branch from 0287599 to 4ed5233 Compare February 5, 2025 23:02
@nimbinatus nimbinatus marked this pull request as ready for review February 6, 2025 15:26
Copy link
Contributor

@jwm4 jwm4 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This basically looks good to me but I had some review comments too.


### Use a schema field rather than directory tree structure

Drop the folder structure in favor of a schema field for submission type and even domain, if necessary. The schema field can be entered automatically via the UI through a user selecting `knowledge` or `skill`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another thing the directory tree structure lets you do is organize at multiple levels, e.g.,knowledge/animals/reptiles/turtles. I don't know how important that ability is. I guess we could encode that in the schema field or (e.g., have the value be knowledge/animals/reptiles/turtles) but I think it would make more sense to have a field for this purpose, e.g., the schema could be knowledge and the categorization could be animals/reptiles/turtles.

If we don't have the ability to nest knowledge and skills into groups and subgroups, then I would say it is not really a taxonomy.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I honestly think that's really the question here, and I honestly don't know. I can't find reference to it being useful in the code, but I may be missing something. Is the whole nested structure really necessary?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here's one reason why it could turn out to be useful eventually: Imagine you have too much data and you want to do some sort of subset selection either before running SDG (generate last data to begin with) or after running SDG (removing some of the data that you just generated). In either case you might want to use the hierarchy to constrain what gets discarded. Maybe you want to ensure the coverage stays as wide as possible so you would rather discard half of the stuff in animals/reptiles and half of the stuff in animals/insects then discarding all of the stuff in one and none of the stuff in the other. Or maybe you want to go the other way: you would rather teach to mastery in some subjects then teach to partial mastery in more subjects and then you would really rather discard an entire branch of the taxonomy then you would want to discard lots of pieces of lots of different branches. In either case having that structure would be useful.

On the other hand, in general is not a great idea to include stuff in a schema because there's a hypothetical argument for why it might be useful someday. So maybe we should put more thought into whether we really do want to do these things in the foreseeable future before making a commitment one way or the other.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, here's another related reason: say you are building a large taxonomy (either the community taxonomy or maybe a private internal taxonomy for some customer). Maybe in phase one the taxonomy covers a large variety of topics and then in phase two it starts to get so large and you decide what you really want is to split up the taxonomy into pieces so that you can train separate models for each piece. In that case, having the hierarchy could be super useful because you've already done the work to split it up into pieces as you added stuff to the taxonomy instead waiting until you want to do the splitting up now you have this huge undifferentiated blob that you need to deal with. That might be a better argument for keeping the hierarchy than the one that I mentioned above around subsetting.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

File directory structure could end up being metadata on ingested chunks as well, which could be useful in multiple ways.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it work if we allow but not require the directory structure, while making sure the filepath is part of metadata? Then it's up to the user to decide how they want to organize their files.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think so. The important thing, from the code perspective, is that the actual directory names and nested tree structure is not really used other than to name temporary files and identify where there are new files. I've dug through the codebase for SDG. A user organize their files in a taxonomy like science > biology > ornithology or in one like documents > reports > lab, and the SDG process doesn't know the difference. A taxonomy with knowledge categorization is a human construct for human use, and I don't think we should define that for the end user if we want this to be generalizable to any end user's system. We can encourage sharing that information with us through metadata, sure, but it's not something we use otherwise.

In short, I think maintaining a document store and version control for a user, whether forcing a specific way to do it or providing a system ourselves, is outside of our remit with the project.


### Switch to JSON and Markdown for the `qna.yaml` document

Allow the user to use Markdown in a WYSIWYG experience, and then use a Markdown-to-JSON converter to handle the conversion to a code-friendly format.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am a little worried that this would lead to a situation where it is easy to write Markdown that looks good but is not usable by SDG after it goes through the Markdown-to-JSON converter. However, I am not sure how big a deal that would be. What might help provide some intuitions is an example of what such a Markdown would look like; that might provide more of a sense of how hard it is to write SDG-complient Markdown.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Also, FWIW, I agree with the premise here that YAML is a big part of the problem with our current taxonomy format.)


## Unaddressed concerns

The issue of needing a git repository for document storage is possibly out of scope of this document. However, I'm adding it as something that may need its own ADR/dev doc. The end user experience of needing a git repository is needlessly complex and also still follows the idea of the upstream taxonomy and community model build. A user working with InstructLab locally does not need the version tracking provided by git and likely probably already has a document storage system. I propose changing the general idea from a git repository to a simple address, whether that's local storage, remote storage, or a version-controlled repository. Make it more flexible.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Under "Streamline the schema", you proposed "Enforce those fields for the upstream taxonomy though documentation, CI, and review rather than require them for everyone." I feel we could do the same for the git repository issue: i.e., community submissions are required to link to git and provide commit hashes but that's enforced in the community repo not in InstructLab.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. I have that in the Unaddressed Concerns section as I'm not sure whether that's in scope for this specific change. Which team owns that part of the process isn't quite clear, so I didn't want to go tromping on toes...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess I would prefer to go the other way: Include the git stuff in this proposal and then see if anyone pushes back. If they have a good argument for keeping the git enforcement in InstructLab (instead of enforcing it at the community level), they can make it.

@nimbinatus nimbinatus changed the title Taxonomy revamp 2025 Enhancement Proposal: Taxonomy Updates Feb 6, 2025
Copy link
Member

@alinaryan alinaryan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are great ideas! I would also suggest addressing the git dependency, upstream taxonomy, and community model builds in this doc. Alternatively, opening dev docs simultaneously to this one could be helpful, because each idea informs the other.


The user experience of working with the `qna.yaml` file is poor for a handful of reasons:

- Many of the fields in the `qna.yaml` file are unnecessary save for use in the upstream taxonomy.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you reference an example of this?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I read this as referring to the stuff in the "Streamline the schema" section below; if that's what is meant, then maybe a note like "See 'Streamline the schema' below" would be helpful here.

Copy link
Member

@alinaryan alinaryan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are great ideas! I would also suggest addressing the git dependency, upstream taxonomy, and community model builds in this doc. Alternatively, opening dev docs simultaneously to this one could be helpful, because each idea informs the other.

@nimbinatus
Copy link
Author

I would also suggest addressing the git dependency, upstream taxonomy, and community model builds in this doc. Alternatively, opening dev docs simultaneously to this one could be helpful, because each idea informs the other.

I think the questions of what to do with the upstream taxonomy and community model build are separate situations and not relevant to this document; there are references to the difference in a few places:

Our taxonomy tree structure and knowledge/skill file structure was designed with upstream taxonomy submissions in mind. An end user working with a taxonomy locally using InstructLab has to follow all of those requirements, increasing complexity of their work.

and

The end user gets hung up on where to place the file in a massive file tree where sub-branches are not defined. For someone not working in the upstream taxonomy, this is basically bikeshedding[^1]. The only requirement for the SDG process is sorting things into knowledge and skills.

and

Enforce those fields for the upstream taxonomy though documentation, CI, and review rather than require them for everyone.

are some examples.

For the git dependency, I addressed an initial thought to solving that problem in the unaddressed concerns section, as I mentioned to Bill above. If there is a consensus that I fold that into this document, though, I am happy to do so.


Write documentation and tutorials based on existing tutorials on writing reading comprehension questions and example answers for standardized exams.

Most people can understand reading to learn versus learning to read type questions. The new, streamlined schema that matches the most simple needs could help here along with a solid set of docs and tutorials on how to write reading comprehension sets. We could borrow heavily from the standard tutorials for writing standardized exams that are out there for free and already battle-tested.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question answering and reading comprehension are different tasks. Reading comprehension is about pulling information given a text, while question answering can involve reasoning / synthesis when a question is not concretely answered by some piece of text.

Is training on reading comprehension questions enough? @jwm4, thoughts?


## Unaddressed concerns

The issue of needing a git repository for document storage is possibly out of scope of this document. However, I'm adding it as something that may need its own ADR/dev doc. The end user experience of needing a git repository is needlessly complex and also still follows the idea of the upstream taxonomy and community model build. A user working with InstructLab locally does not need the version tracking provided by git and likely probably already has a document storage system. I propose changing the general idea from a git repository to a simple address, whether that's local storage, remote storage, or a version-controlled repository. Make it more flexible and extensible to match where someone chooses to store their data, perhaps through an environment variable to set as one implementation example. This could also decouple the documentation process from the SDG process by allowing the end-user subject-matter expert to create and upload content to a central store without ever touching InstructLab's tooling chains and then a end-user operations or development specialist to run the InstructLab tooling separately.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would call out that one thing that git gives is a built-in data provenance system - change tracking, change attribution, modification dates (consider enterprise uses cases where you might have conflicting texts and want to pick the more recent ones, maybe such as HR policy documents).

I agree that dependence on git is best removed, but I think it important to not lose data provenance. This will also become important when we want to robustly handle document updates and re-ingestion.

Copy link
Author

@nimbinatus nimbinatus Feb 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suspect most end users already have their own system, whether it solves the data provenance problem or not, and are reluctant to add another to their stack. It's much easier to use, say, the versioning available on Azure for a document rather than teach their end users about git, and we don't have to maintain the system for them as an open source project.

The user experience of working with the `qna.yaml` file is poor for a handful of reasons:

- Many of the fields in the `qna.yaml` file are unnecessary save for use in the upstream taxonomy.
- YAML is a notoriously complex, loose format with a lot of potholes. As a couple of examples:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@nimbinatus
Copy link
Author

/hold

Comments and reviews are welcome; I just want to be sure the SDG team gets time to review this :)


Make `created_by`, `domain`, and `document_outline` optional fields. Enforce those fields for the upstream taxonomy though documentation, CI, and review rather than require them for everyone.

### Switch to JSON and Markdown for the `qna.yaml` document
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The issue with using markdown in our projects is that often times you want to include markdown in your content, so doing it this way will break the parsing without having a hacky solution.

JSON I like but it's not easy for humans. I do think we should support it, but I wouldn't rely on it as a primary "user-facing" format.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For using Markdown, the idea is a simple user-friendly writing format outside of the UI (which would be the happy path in my mind). Since we already require the markdown to be parsed when it gets taken into the SDG process, I think it makes sense to let the user see the output rendered, allowing them some generic ability to understand whether they converted things correctly.

JSON is the transport format, basically. An end user writing seed examples would not see it unless they explicitly choose to skip the markdown/UI formats and write it themselves. I would imagine that someone making that decision knows how to use it. But this allows us to add a programmatic guardrail in the conversion process to handle any and all translation layers, and I think that the conversion could be just as well handled by Docling as anything else, further reducing our dependency footprint and attack surface.

Copy link
Member

@RobotSail RobotSail Feb 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nimbinatus I appreciate your thorough response, however I'm still having a hard time understanding your idea around Markdown. Could you please provide an example of what you're thinking of? I think that will help clear up my confusion.

I agree with your point around JSON, that makes sense 👍


Users who decide to build it without needing the converter are likely familiar with JSON, and there are fewer pitfalls and less likelihood of tooling choices impacting meaning (e.g., where line breaks are for paragraph structures) as the JSON standard has not changed since 2017, and barely changed from the original standard.

### Reframe the Q&A writing process as a reading comprehension process
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is good for knowledge, I believe @abhi1092 is actually working on something along the lines of using natural concepts from reading comprehension for the Knowledge 1.5 pipeline.

For skills training though it's probably a bit different since there we want the model to learn how to transform and permute different data.

Copy link
Member

@RobotSail RobotSail left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few points but I like the overall idea

@nimbinatus
Copy link
Author

nimbinatus commented Feb 7, 2025

Note: I've been given new information about the taxonomy file structure. I may be updating this document soon, but please still leave me comments.


### Use a schema field rather than directory tree structure

Drop the folder structure in favor of a schema field for submission type and even domain, if necessary. The schema field can be entered automatically via the UI through a user selecting `knowledge` or `skill`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it work if we allow but not require the directory structure, while making sure the filepath is part of metadata? Then it's up to the user to decide how they want to organize their files.


### Streamline the schema

Make `created_by`, `domain`, and `document_outline` optional fields. Enforce those fields for the upstream taxonomy though documentation, CI, and review rather than require them for everyone.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: though -> through

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants